Q and A
Define data mining. | ||||||||||
Data mining is technique of extracting hidden, unknown knowledge from a huge volume of data. | ||||||||||
What is data pre-processing in data mining? | ||||||||||
Data pre-processing is process of preparing data for data mining task. Data pre-processing removes redundancies from data, makes data consistent, error-free, transforms, data to format required by the data mining algorithms, reduces size of data. | ||||||||||
What are the methods of data transformation? | ||||||||||
| ||||||||||
What are the methods for data reduction? | ||||||||||
| ||||||||||
What is smoothing? | ||||||||||
Smoothing is the technique to make data consistent. Binning by average, binning by boundaries are smoothing techniques. | ||||||||||
What is data cube? | ||||||||||
Data cube is representation of summarized data along several dimensions. Data cube allows to view and analyze data in multiple dimensions. | ||||||||||
What are dimensions and facts in star schema? | ||||||||||
Dimensions are object or entities or perspectives about which organizations store data; time , item, location, etc. are dimensions. Facts are numerical measures by which organizations analyze relationship between dimensions; units_sold, total_sold etc. are examples of facts. | ||||||||||
What is advantage of having data warehouse? | ||||||||||
Data warehouse can provide competitive advantage as it presents relevant information; it helps to enhance productivity of organization as it is able to collect data quickly and efficiently; it provides consistent view of customers and helps to improve relationship with customers; it can help to reduce cost as it helps to track trends, patterns, exceptions over long periods consistently. | ||||||||||
What is three-tier architecture of data warehouse? | ||||||||||
Bottom Tier is data warehouse server, a relational database; The middle tier is an OLAP server (ROLAP: extended relational database which maps multidimensional data to standard relational operations); (MOLAP: a special purpose server that directly maps multidimensional data and operations.) The Top tier is a user interface, which contains query and reporting tools, analysis tools, data mining tools. | ||||||||||
what are the major components of data mining system? | ||||||||||
| ||||||||||
Describe Apriori Algorithm. | ||||||||||
Apriori algorithm is an iterative approach.
| ||||||||||
How Apriori algorithm can be improved? | ||||||||||
By transaction reduction, A transaction not containing any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. | ||||||||||
Explain Frequent Pattern (FP) -growth algorithm. | ||||||||||
FP-growth adopts divide-and-conquer strategy. First, it creates a frequent-pattern tree from database representing frequent items. This tree retains itemset association information. Secondly, it creates conditional database from frequent pattern tree, each associated with one frequent item or pattern fragment, and mines each database separately. | ||||||||||
What kind of data can be mined? | ||||||||||
Data mining can be performed on number of different data repositories:
| ||||||||||
What is transactional database? | ||||||||||
A transaction includes a transaction identifier, and a list of items such as products purchased in a store. A transactional database consists of a file representing a transaction. | ||||||||||
What kinds of Patterns can be mined? | ||||||||||
Data mining tasks can be classified into two categories:
Descriptive and predictive Descriptive mining defines the general properties of data. Predictive mining performs inference on the current data to make predictions. Data mining functionalities are:
| ||||||||||
Are all the patterns generated by data mining tasks interesting? | ||||||||||
It is not necessary. Answer to following questions are necessary:
| ||||||||||
How to measure interestingness of pattern? | ||||||||||
Two measures for validate interestingness of patters:
| ||||||||||
What is data cleaning? | ||||||||||
Data cleaning is process of filling in missing values, smoothing out noise, correct inconsistencies in the data. | ||||||||||
How missing values can be filled? | ||||||||||
| ||||||||||
What is noise? | ||||||||||
Noise is a random error or variance in a variable. | ||||||||||
How to smooth out noise from numeric data? | ||||||||||
Different binning methods, regression analysis, clustering can be used to smooth out noise from numeric data. | ||||||||||
How redundancies in data can be detected? | ||||||||||
Correlation analysis can be used to detect some redundancies. If between two attributes, growth or decline in one attribute is seen and at the same time growth or decline in another attribute is also seen, then one of the attributes can be considered as redundant. | ||||||||||
What is data transformation? | ||||||||||
Data transformation process of transforming data to
the form appropriate for data mining. Some of the data
transformation techniques are:
| ||||||||||
What is data reduction? | ||||||||||
Data reduction is process of generating reduced
representation of data set without losing integrity, meaning
of original data. Data can be reduced using following techniques:
| ||||||||||
What is classification? | ||||||||||
Classification is a process of grouping data, elements, objects based on their similarity. Objects (elements) have attributes; similarity between the objects is calculated using the attributes of the objects. Decision tree, Neural network, Bayesian classifier are classification algorithms. | ||||||||||
What is association analysis? | ||||||||||
Association analysis is the process of finding relation or association between objects (elements); presence of one object influences presence of another object. If A is purchased then X will also be purchased is an example of association. Market Basket Analysis is an example of association analysis. Apriori Algorithm, Frequent Pattern Tree (FP Tree) are association analysis algorithms. | ||||||||||
What is clustering? | ||||||||||
Clustering is grouping of objects based on closeness between the objects. Clustering is different from classification as clustering is non-supervised process and classification is supervised process. K-means, K-medoids, DBSCAN, Agglomerative, Divisive are clustering algorithms. | ||||||||||
What are the processes involved in data warehousing? | ||||||||||
The processes involved in data warehousing are:
| ||||||||||
What are the components of data warehouse for handling data warehousing processes? | ||||||||||
| ||||||||||
How is database size calculated for setting up a Data Warehouse? | ||||||||||
Following entities are included to calculate size required for setting up a Data Warehouse:
| ||||||||||
Why are aggregate tables created in data warehouse? | ||||||||||
Aggregate tables are created to speed up query response time. User's ask several queries, creating summary tables instantly will be a time consuming process, users' have to wait long to get their responses. If the aggregate tables match the user query then user will get immediate response. | ||||||||||
What is Bayesian Classifier? | ||||||||||
Bayesian classifier is a statistical classifier which can predict the probability of a tuple belonging to a particular class. Bayesian classifier is based on Baye's theorm. Bayesian classifier is called naive as it assumes that the effect of one attribute on a class is independent of the values of other attributes. | ||||||||||
How Bayesian classifier works? | ||||||||||
| ||||||||||
What is backpropagation? | ||||||||||
backpropagation is a neural network learning algorithm. Neural
network is defined as a set of connected input/output units, each
connection have a weight associated with it. Neural network learns by adjusting
the weights and is called connectionist learning. High tolerance of noisy data and ability to classify patterns on which it is not trained are advantages of neural network. | ||||||||||
How does backpropagation work? | ||||||||||
The error value between target value and predicated value is calculated, then error value is propagated backwards from output layer to hidden layer, from hidden layer to input layer to adjust the weights of the connections. Neural Network with adjustment of weights from backward to forward is called back propagation neural network. | ||||||||||
What is decision tree? | ||||||||||
Decision tree is a tree structure with internal nodes and leaf. Internal nodes denote test on an attribute, branch represents an outcome of the test, and leaf node holds class label. Topmost node in a tree is the root node. | ||||||||||
How is decision tree used for classification? | ||||||||||
Attribute of a tuple, X with unknown class is tested against the decision tree; path from root to the leaf is traced; the leaf node provides the class label for the given tuple, X. |